Data deduplication cache comprising solid state drive storage and the like

ABSTRACT

Methods and systems for retrieving deduplicated data by a system having a first storage device and a second storage device to store deduplicated data are described, wherein data is retrievable from the first storage faster than data is retrievable from the second storage, in including receiving a request from a client machine for deduplicated data and determining a location the requested deduplicated data. If the data is in the second storage device, the method further includes retrieving the data, providing the data to the client machine, and moving the retrieved data to the first storage device. If the first storage is full, data may be moved to the second storage to make room for data to be stored in the first storage. One or more factors may be used to determine which data to move out of the first storage, if necessary. The first storage may be an SSD device.

RELATED APPLICATION

The present invention is a division of U.S. patent application Ser. No.15/406,198, which was filed on Jan. 13, 2017, which claims the benefitof U.S. Provisional Patent Application No. 62/279,283, which was filedon Jan. 15, 2016, both of which are assigned to the assignee of thepresent invention and are incorporated by reference herein.

FIELD OF THE INVENTION

Data deduplication systems and methods and, more particularly,deduplication systems and methods including a deduplication cachecomprising fast storage, such as solid state drive (“SSD”) storage, forexample.

BACKGROUND OF THE INVENTION

Data deduplication reduces storage requirements of a system by removingredundant data, while preserving the appearance and the presentation ofthe original data. For example, two or more identical copies of the samedocument may appear in storage in a computer and may be identified byunrelated names. Normally, storage is required for each document.Through data deduplication, the redundant data in storage is identifiedand removed, freeing storage space for other data. Where multiple copiesof the same data are stored, the reduction of used storage may becomesignificant. Portions of documents or files that are identical toportions of other documents or files may also be deduplicated, resultingin additional storage reduction.

To implement data deduplication, in one example data blocks are hashed,resulting in hash values, also referred to as message digests, that aresmaller than the original blocks of data and that uniquely represent therespective data blocks. Substantially collision free algorithms, such as20 byte SHA-1 hash or MD5 hash algorithms, may be used, for example.Blocks with the same hash value are identified and only one copy of thatdata block is stored. Pointers to all the original locations of theblocks with the same data are stored in a table, in association with thehash value of the blocks. A stub file may be created on the clientmachine to replace the deduplicated value. A pointer may be provided inthe stub file on the client machine to associate the stored data blockon the deduplication system with the location or locations of thediscarded data block or blocks on the client machine. The stub file maycontain the hash, which may also be used to locate deduplicated data inthe deduplication system.

A remote deduplication system may be provided to perform deduplicationof other machines, such as client machines, and storing deduplicateddata, the deduplication system may provide a standard network fileinterface, such as Network File System (“NSF”) or Common Internet FileSystem (“CIFS”), to the other machines. Data input to the deduplicationsystem by the client machines is analyzed for data block redundancy.Storage space on or associated with the deduplication system is thenallocated by the device to only the unique data blocks that are notalready stored on or by the device. Redundant data blocks (those havinga hash value that is the same as a data block that is already stored,for example) are discarded. This process can be dynamic, wherededuplication is conducted while the data is arriving at thededuplication system, or delayed, where the arriving data is temporarilystored and then analyzed and deduplicated by the deduplication system.In one example, the data set is transmitted by the client machinestoring the data to be deduplicated to the deduplication system beforethe redundancy can be removed. The client machine may mount networkshared storage (“network share”) of the deduplication system to transmitthe data. Data is transmitted to the deduplication system via the NFS,CIFS, or other protocol providing the transport and interface.

In another example, the deduplication system mounts network sharedstorage of the client machine to access the data to be deduplicated, asdescribed in U.S. Patent Publication No. 2012/0089578, which is assignedto the assignee of the present invention and is incorporated byreference herein. Mounting the network stored storage of the clientmachine by the deduplication system avoids the need to transfer largeamounts of data across a network, saving network bandwidth. An externaldata mover is not required.

The deduplication process is transparent to the client machines that areputting the data into the storage system. The users of the clientmachines do not, therefore, require special or specific knowledge of theworking of the deduplication system.

When a user on a client machine accesses a document or other data fromthe client machine, the data will be looked up in the deduplicationsystem according to index information, based on the pointer or hash inthe stub file, for example. The stored data is returned to the usertransparently, via NSF or CIFS, or other network protocols, by thededuplication system.

SUMMARY OF THE INVENTION

While tape storage may be read sequentially at a high rate, searchingtape libraries for a particular data may be slow. Retrieval ofdeduplicated data blocks from tape storage may also therefore be slow.

In accordance with embodiments of the invention, methods and systems areprovided for storing deduplicated data that is more likely to be neededin fast storage cache that is faster than other storage available on thesystem. In this way, the data stored in the fast storage cache may bemore rapidly retrieved than data stored in other storage in thededuplication system. In one example, the fast storage cache (or“cache”) may be non-spinning storage while the other storage is spinningstorage. The fast storage cache may be a solid state drive (“SSD”)storage, for example.

Data most likely to be needed may be data needed for read, write, andrestore requests from client machines, for example. In one example, datathat is considered to be more likely to be needed and therefore storedin the fast storage cache may be the data currently received by thededuplication system for deduplication. The determination that a datablock is more likely to be needed may instead or in addition be based onother factors, such as a location of the data on the client machine. Forexample, a user may desire that data sent for deduplication thatoriginates on a particular drive be stored in the cache for rapidretrieval, while data stored on other drives would not. Data that isstored in greater than a predetermined number of locations on the clientmachine may also be considered more likely to be needed. Data receivedfrom a particular client machine may also be prioritized for storage inthe rapid storage. Other filters and/or a combination of filters may beused. An algorithm may be provided with weightings may be used by thededuplication system to determine whether to store received data in thecache.

Since the cache has a finite capacity, it may be necessary to move dataout of the cache to make room for data that is more likely to be needed.In one example, the oldest data block stored in the cache is removed tomake room for a new data. In another example, the data block that hasbeen in the cache for the longest time without being requested isremoved from the cache. Removal from the cache may also be based on thenumber of past requests for data blocks and/or the types of requests fordata blocks, such as read, write, and/or restore requests. Probabilitiesthat a respective data block may be requested may be determined based onpast requests, for example, and the data block with the lowestprobability of being requested may be moved. Data may also be removedbased on a location of respective data blocks in the client machineproviding the data of the data block for deduplication, a number oflocations of the data on the client machine providing the data of thedata block for deduplication, and/or an identity of the client machineproviding the data of the data block for deduplication. Data may also bemoved from the cache, or not, based on a number of times respective datahas been received for deduplication from more than one client machine.The determination of which data to remove to make room for a newlydeduplicated data block may be based on one or more of these or otherfactors. An algorithm may be used to balance two or more of these andother factors, to determine which data block to move out of the cache.The algorithm may include weightings of the different factors, forexample.

In accordance with a first embodiment of the invention, a method isdisclosed for retrieving deduplicated data from a deduplication systemcomprising a first storage device and a second storage device forstoring deduplicated data. The method comprises receiving a request froma client machine for deduplicated data and determining a location of therequested deduplicated data. If the data is stored in the second storagedevice, the data is retrieved, retrieved data is provided to the clientmachine, and the retrieved data is moved to the first storage device.

In accordance with a second embodiment of the invention, a deduplicationsystem is disclosed comprising a first storage device and a secondstorage device, wherein data is retrievable from the first storagedevice faster than data is retrievable from the second storage device.At least one processing device is configured to receive a request from aclient machine for deduplicated data and determine a location of therequested deduplicated data. If the data is in the second storagedevice, the at least one processing device is configured to retrieve thedata, provide the retrieved data to the client machine, and move theretrieved data to the first storage.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 is a block diagram of an example of a data processing environmentin which deduplication is performed by a deduplication system, inaccordance with an embodiment of the invention; for respective factorsmay be used to determine whether received data is to be stored in thecache.

FIG. 2 is a schematic representation of a hash table;

FIG. 3 is a more detailed block diagram of an exemplary deduplicationsystem for use in the data processing environment of FIG. 1, inaccordance with an embodiment of the invention;

FIG. 4 is a flowchart of an example of a deduplication routine to backupin selected storage devices in the data processing environment of FIG.1, including cache storage for rapid retrieval, in accordance with anembodiment of the invention; and

FIG. 5 is a flowchart of an example of a method of retrievingdeduplicated data in the deduplication system of FIG. 1, in accordancewith an embodiment of the invention.

DETAILED DESCRIPTION OF EMBODIMENTS

In accordance with embodiments of the invention, methods and systems areprovided for storing in fast storage deduplicated data that is morelikely to be needed, for rapid retrieval. In one example, the faststorage is solid state drive (“SSD”) storage. In another example, thefast storage is storage that is faster than other storage available inthe system. In one example, data considered to be more likely to beneeded is data that has most recently been received by the deduplicationsystem for deduplication. Other factors that may be taken intoconsideration in determining that received data is more likely to beneeded and is to be stored in the fast storage are discussed herein. Ifthe fast storage is full, a block of data may be removed from the faststorage to make room for the new data. The block of data removed may bethe oldest block of data, for example. Other factors or a combination offactors may also be considered in determining which block of data tomove out of the fast storage, as discussed herein. While receipt anddeduplication of data blocks by the deduplication system is describedbelow, the data received and deduplicated, and/or the data moved out ofthe fast storage, if necessary, need not be in the form of data blocks.

FIG. 1 is a block diagram of an example of a data processing environment100 in which deduplication may be performed by a deduplication system140, in accordance with an embodiment of the invention. The system 100comprises one or more client machines 110, 112, and 114, a network 120,and a deduplication system 140. While three client machines 110, 112,and 114 are shown in FIG. 1, more or fewer client machines may beincluded in the data processing environment 100.

Each client machine 110, 112, 114 may comprise hardware, software, or acombination of hardware and software. In one example, each clientmachine 110, 112, 114 comprises one or more computers or other devices,such as one or more personal computers (PCs), servers, and/orworkstations. Alternatively, the one or more of the client machines 110,112, 114 may comprise a software application residing on a computer orother device.

To store data locally, the client machines 110, 112, 114 may alsoinclude local storage devices 152, 154, 156, respectively. The storagedevices 152, 154, 156 may comprise any device that is capable of storingdata, such as disk drives, tape drives, flash drives, and/or opticaldisks, etc. Alternatively, each client machine 110, 112, 114 may haveaccess to a respective external storage device to store data. In thisexample, the client machines 110 and 114 have access to external storage158 and 160, respectively. External storage may be provided instead ofor along with the local storage 152, 154, 156.

The network 120 may comprise a single network or a number of differenttypes of networks. The network 120 may be or include an intranet, alocal area network (LAN), a wide area network (WAN), Fiber Channelstorage area network (SAN), Ethernet, and/or the Internet, for example.Communications may be conducted over the network 120 by means of IPprotocols and/or Fiber Channel protocols, such as NFS or CIFS forexample.

In the example of FIG. 1, the deduplication system 140 comprises aprocessor 142, memory 144, and a primary storage device 146. The primarystorage device 146 may comprise any device that is capable of storingdata, such as a disk drive, tape drive, flash drive, and/or opticaldisk, etc. Alternatively or in addition, the deduplication system 140may use an external storage device 147 and/or a storage deviceaccessible over the network 120. The deduplication system 140 may alsobe configured as a system. The memory block 144 in FIG. 1 may representRAM, ROM, and/or disk storage, for example.

The processor 142 may be a suitably programmed computer or otherprogrammed processing device, for example. The processor 142 maycomprise hardware configured to operate as described herein, such ashardware configured to operate under the control of software, and/or anapplication specific integrated circuit, for example. A combination ofconfigured hardware and hardware operating under the control of softwaremay also be used. Suitable software for controlling operation of theprocessor 142 may be stored in the memory 144, such as in ROM or on adisk, for example.

In accordance with an embodiment of the invention, the deduplicationsystem 140 includes or has access to a deduplication cache 148 thatallows for more rapid retrieval of data stored than data can beretrieved from the primary storage 146 and external storage 147, ifprovided. The deduplication cache 148 may be any memory device thatallows for more rapid retrieval of data then the primary storage device146 and/or the external storage device 147. For example, where theprimary/external storage devices 146/147 are spinning storage devices,such as disk drives, the cache 148 is a faster storage than the spinningstorage, such as a non-spinning storage. The cache 148 may comprisesolid state devices (“SSD”), for example. In FIG. 1, the deduplicationcache 148 is part of the deduplication system 140. In other examples anexternal cache 148 is provided. Both external and internal caches 148may be provided. FIG. 3, discussed below, is a more detailed blockdiagram of an example of a deduplication system 140.

The deduplication system 140 deduplicates data provided by the clientmachines 110-114 to the deduplication system 140 across the network 120,and/or accesses data located on the local storage devices 152, 154, 156,158 and/or 160 via the network, to deduplicate the data. Data may beaccessed by mounting network shared stored on any or all of the localstorage 152, 154, 156 and/or the external storage 158, 160, for example,as described in U.S. Patent Publication No. 2012/0089576. Thededuplicated data is stored in the storage device 142 and/or theexternal storage 147.

In accordance with embodiments of the invention, certain deduplicateddata, such as data more likely to be requested for retrieval by theclient machines 110, 112, 114, is selectively stored in, andpreferentially maintained in, the deduplication cache 148 for fasterretrieval. In one example, it is assumed that the data most likely to berequested is or comprises most recently received data for deduplication.In another example, it is assumed that the data most likely to berequested is or comprises most recently requested data, data requestedmore than a predetermined number of times since being stored in thecache, or data requested more than a predetermined number of times in apredetermined time period. Requested data may include data requested forrestore, read, and/or write operations, for example. The data stored andmaintained in the cache may also be based on the type of past requests.For example, a data block for which more than a predetermined number ofread and/or write requests may be considered to be more likely to neededthan data blocks subject to restore requests, or vice-a-versa.

In another example, data to be stored in the cache 148 and maintained inthe cache may be or comprise data or data blocks stored in more than apredetermined number of locations on the client machine 110, 112, 114.For example, sub-files on client machines 110-114 may be reviewed todetermine if a data block is stored in more than a predetermined numberof locations. A hash table, such as the hash table 200, discussed belowwith respect to FIG. 2, may also be reviewed, instead of or along withreviewing sub-files, to determine whether a data block is stored morethan the predetermined number of locations in a client machine. If thededuplicated data is stored on tape, common file headers may be reviewedto determine whether a respective data block is stored in more than apredetermined number of locations.

In another example, the data to be stored in the cache 148 is data froma storage location on a client device of a client machine 110, 112, 114that is designated by the client, such as data stored on a particulardrive, for example. The identity of the client machine may also be afactor.

Other filters and/or a combination of filters may be used to determinewhether received data should be stored in the cache 148 and/or whetherto move a data block out of the cache to make room for the currentlyreceived data. The processor 142, for example, can determine whether tostore received data is stored in the cache 148 or primary storage 146,and whether to move data out of the cache 148 under the control ofalgorithms stored in the memory 144, for example, the algorithms mayinclude weightings for two or more respective factors.

Data may be stored in the storage devices 152-160 of the client machines110, 112, 114, in the form of data files, which may in turn be organizedand grouped into folders. A folder is sometimes referred to as a“directory,” and a directory within another directory is sometimesreferred to as a “sub-directory.” Alternatively, data may be storedusing other data structures.

Deduplication functionality may be performed by any technique known inthe art. In one example, the data is deduplicated by dividing the datastored on the storage devices 152-160 into data blocks, or segments ofdata, and processing the data blocks. The processor 142 of thededuplication system 140 reads each data block and computes a messagedigest or digital fingerprint, such as a hash value, of each data block.As is known in the art, message digests are smaller than the originalrespective data blocks and uniquely represent each data block.

In one example, the deduplication system 140 converts the files on theclient machines 110, 112, 114 containing data that has been deduplicated(“deduplicated data”) into stub files. The data or data blocks inrespective files may be replaced with indicators to the locations of thededuplicated data files on the storage devices 146, 147, 148 by thededuplication system 140 directly or via an agent on the respectiveclient machine 110-114, for example. Since the storage locations of thededuplicated data may be changed, however, as discussed herein, theindicator to the current location of the deduplicated data could need tobe changed. Respective indicators may be changed, if needed, by thededuplication system 140 directly or via the agent on the respectiveclient machine 110, 112, 114, for example. The current location of arespective deduplicated data file may be stored in a hash table, such asthe hash table 200 discussed below with respect to FIG. 2, for example,instead of or in addition to the stub file. The current location of arespective deduplicated data file may also be found in a directory of arespective storage device 146, 147, 148, for example.

Alternatively or in addition, a hash value of the respective data may beplaced in the stub files. The hash value may be used by the clientmachine to request the data corresponding to the deduplicated data file,as discussed further below. In either case, the original appearance(directory and file structure) of the data on the client machines 110,112, 114 is preserved. Storage requirements are reduced and availablestorage space is increased on the storage devices 152, 154, 156, 158,160 of the client machines 110, 112, 114. The indicators may bepointers, for example. If a client machine 110, 112, 114 uses a UNIXoperating system, the pointers may be symbolic file links.

Hash values are generated by substantially collision free algorithmswhich generate a probabilistically unique hash value based on inputteddata, as is known in the art. Examples of substantially collision freealgorithms are the SHA-1 algorithm and the MD5 (message digest 5)algorithm. Either may be used, for example, as described in U.S. Pat.Nos. 7,055,008, 7,962,499, and U.S. Patent Application Publication No.2007/0198659, which are assigned to the assignee of the presentinvention and are incorporated by reference herein. U.S. Pat. No.7,954,157, which is also assigned to the assignee of the presentinvention and is incorporated by reference herein, describes examplesother techniques that may also be used. Other substantially collisionfree algorithms may also be used.

The hash values may be stored in a database of hash values by thededuplication system 140, such as “hash” table 200. An example of a hashtable 200 is shown in FIG. 2. The hash table 200 may be stored in thememory 144, the primary storage 146, and/or the external storage 147, oranother memory/storage device. In the schematic representation of thehash table 200 FIG. 2, hash values 202, the one or more locations 204 onthe client machine 110, 112, 114 where the corresponding data block wasstored, and the one or more locations 206 on the memory 144, primarystorage 146, external storage 147, and/or memory cache 148 of thededuplication system 140 where a respective data block is stored, forexample, are correlated. The hash table 200 may also contain metadata208 identifying when (date and optionally time) each data blockcorresponding to a respective hash value was stored in the hash table.

The column 204 including the location on the client machine 110, 112,114 is optional. If the stub files on the client machines containindicators to the location of the data in storage on or associated withthe deduplication system 140, and the location of the data block ischanged, the location 204 of the data block on the respective clientmachine 110-114 may be used to locate the stub file and change theindicator on the stub file to indicate the new location of the stubfile. If indicators are not provided in the stub files, then the column204 is not needed for this purpose. It may still be provided forhousekeeping and other purposes.

When a new (unique) data block is added to the cache 148, the hash valueof the data block is added to the hash table 200. The location 204 ofthe actual data block in the storage of the client machine 110, 112, 114where the original data block came from is associated with the datablock in the hash table 200. The hash table 200 may also record when arespective data block is stored or removed from the cache, in themetadata 208, for example. The cache 148 may also record where and whena respective data block is stored in the hash table 200. The primarystorage 146 may similarly record when a data block is stored or removedfrom the primary storage.

If the hash value of a current data block being deduplicated matches ahash value already in the hash table 200, then the data block is notunique and has already been deduplicated and stored by the deduplicationsystem 140. Another copy of the data block need not be stored on thededuplication system 140, saving storage space on the deduplicationsystem 140 and network bandwidth, if it would have been necessary totransfer the data block across the network 140 to the deduplicationsystem 140 for storage. Instead, the file on the client machine 110,112, 114 containing the deduplicated data block may be replaced by astub file containing a pointer or symbolic link to the location of thededuplicated data block stored in the storage of the deduplicationsystem 140 and/or the hash value of the data block, as discussed above.In either case, since the stub file takes up less storage space than theoriginal data file, storage space is freed for later use by therespective client machine 110, 112, 114. The hash table 200 may beupdated to include the location on the respective client machine 110,112, 114 where the current data block is located in column 204, forexample, if such information is stored in the hash table.

The size of each data block may be fixed or variable, depending on theoperating system or the system administrator's preferences. Fixed blocksare easier to manage, but may waste space. Variable sized blocks make abetter use of the available backup space, but are somewhat moredifficult to keep track of. In addition, the size of the blocks may varyfrom file to file. For instance, one option may be to have each filecontain a set number of blocks, N. The size of each block from a largerfile of size S1 would be S1/N and the size of each block from a smallerfile of size S2 would be S2/N, where S1/N>S2/N. A special case of avariable-sized block is the whole file itself (where N=1, for example),however, it is likely more advantageous to have smaller-sized blocks inorder to avoid having to save large files that change only slightlybetween backups. In addition, the size of the blocks may be limited bythe requirements of the specific algorithm used to create the messagedigest.

FIG. 3 is a more detailed block diagram of an exemplary deduplicationsystem 140 that may be used in the data processing system 100 of FIG. 1to implement embodiments of the invention. The processor 142 primarystorage 146, external storage 147, and memory cache 148 of FIG. 1 areshown. The memory 144 of FIG. 1 is shown including RAM 144 a, ROM 144 b,and disk storage 144 c. Also shown are an interface 172 and a controlmodule 174. The interface 172 provides a communication gateway throughwhich data may be transmitted between the processor 142 and the network120. The interface 404 may comprise any one or more of a number ofdifferent mechanisms, such as one or more SCSI cards, enterprise systemsconnection cards, fiber channel interfaces, modems, or networkinterfaces.

The processor 142 controls the operations of the deduplication system140, including storing and accessing deduplicated data from the primarystorage 146, storing data in and accessing data from the memory 144, andcausing data to be retrieved and transmitted upon request to the clientmachines 110, 112, 114. In one example, the control module 174 directsthe access and deduplication of data from the client machines 110, 112,114, including management of the hash table 200. The processor 142 mayperform these operations along with or instead of the control module174. The memory 144 may comprise random-access memory (RAM), forexample. The memory 144 may be used by the processor 142 to store dataon a short-term basis. In this example, the deduplication system 140comprises a computer, such as an Intel processor-based personalcomputer. The control module 174 may comprise software run by theprocessor 142 or may be a separate processing device. The control module174 may also comprise an application specific integrated circuit, forexample.

The primary storage 146 may comprise one or more disk drives, and/or anyother appropriate device capable of storing data, such as tape drives,flash drives, optical disks, etc. The primary storage 146 may performdata storage operations at a block-level or at a file-level. Theprocessor 142 and the primary storage 146 may be connected by one ormore additional interface devices. In an alternative example, theprimary storage 146 may comprise a storage system separate from thededuplication system 140. In this case, the primary storage 146 maycomprise one or more disk drives, tape drives, flash drives, opticaldisks, etc., and may also comprise an intelligent component, including,for example, a processor, a storage management software application,etc.

The control module 174 may direct the receipt or access of data from theclient machines 110, 112, 114 and cause the data to be deduplicated andstored in the primary storage device 142, external storage 147, and/orcache 148. To facilitate the storage of the deduplicated data blocks,the control module 174 may maintain one or more databases in the primarystorage 146. For example, the control module 174 may create and maintaina file object database 176 in the primary storage 146. The file objectdatabase 176 may be maintained in the form of a file directory structurecontaining files and folders containing pointers to the locations in thededuplication system 140 of the data blocks in each file in the hashtable 200, instead of or in addition to the data in column 206 of thehash table 200, for example. In addition, the file object database 176may comprise a relational database or any other appropriate datastructure of the data blocks stored on the primary storage 146. Thedirectories, files, and folders may be based on the correspondingdirectories, files, and folders on the client machines 110, 112, 114containing the data sent to the deduplication system 140. The controlmodule 174 may also maintain the hash table 200. The processor 142 maybe configured to provide any of these functions, instead of the controlmodule 174.

The control module 174 and/or the processor 142 of the deduplicationsystem 140 may cause data to be backed up in accordance with a schedulecomprising one or more backup policies established by the respectiveclient machines 110, 112, 114, for example. The backup policies mayspecify parameters including the storage device, directory, or file tobe backed up; the backup time; and/or the backup frequency, etc., foreach client machine 110, 112, 114. To enable a user to establish suchbackup policies, an agent may be provided by the deduplication system140 to the client machines 110, 112, 114, via the network 120, foroperation on the client machine, for example. The agent on each clientmachine 110, 112, 114 may generate a graphical user interface (“GUI”)for use by each respective client machine to facilitate the initialsetup and selection of parameters for the backup policy, and to transmitthe policy to the deduplication system 140. The deduplication system 140may further coordinate the prioritization, scheduling, and other aspectsof one or more clients' respective backup policies. This enablesefficient use of the resources of the deduplication system 140. Thesetting of backup policies is described in more detail in U.S. PatentPublication No. 2012/0089578, which is assigned to the assignee of thepresent application and is incorporated by reference herein.

The GUI on the client machines 110, 112, 114 may also be used byrespective clients to designate which types of data from their clientmachine should be stored in the cache 148, such as the current datablock provided for deduplication, data from one or more particularstorage locations, and/or data appearing more than a predeterminednumber of times on the respective client machine. The client could alsodetermine the predetermined number, for example. The GUI may also beused by the client to select the one or more factors to be used todetermine which data blocks to move out of the cache if room is neededin the cache. Options may be presented to the client in a drop downmenu, for example.

As discussed above, if the data block is found to be unique (the hashvalue is not already stored in the hash table 200) by the processor 142,then an identical data block has not already been received, hashed,deduplicated, and stored by the deduplication system 140. In accordancewith an embodiment of the invention if there is room in the cache 148,the data block is then stored on the cache 148. If there is no room inthe cache 148, then the oldest data block is removed from the cache 148to make room for the new data block. In one example, data blocks areremoved from the cache 148 in a first-in, first-out (“FIFO”) manner inwhich the oldest data block (first-in) is removed (first-out) to makeroom for a new data block. In the hash table 200 of FIG. 2, the oldestdata block in the cache 148 may be the lowest (oldest) data block in thehash table 200, but that is not necessarily the case. The associatedmetadata 208 associated with respective data blocks may indicate thedate and optionally the time a hash value was stored, which may be usedto determine the oldest data block in the cache 148, for example.

In another example, a modified FIFO method may be used, in which otherfactors are added to the FIFO method. This may be referred to asprioritized queuing. In one example of a modified FIFO method, asubsequently requested data block may be treated as a more recentlyadded data block. For example, instead of selecting the oldest datablock stored in the cache 148 for movement out of the cache to theprimary storage 146 or other storage, the data block selected to bemoved may be the oldest data block that has not been requested, or hasless than a predetermined number of requests, since it has been storedin the cache or within a predetermined period of time. In one example, acount of requests may be kept for each data block which is incremented(or decremented) based on each request for the data block and/or otherfactors, for example. The count may be maintained in the metadata 208 inthe hash table 200 or another table, for example, by the deduplicationdevice 140. In one example, the control module 174 maintains the count.In another example the processor 142 maintains the count. The datablock(s) with the lowest current count would have the lowest priorityvalue and would then be moved to the slow storage such as the primarystorage 146, when additional space is needed in the cache 148.

In another example, the data block with the lowest probability of beingrequested may be selected to be moved. The probability that respectivedata blocks will be requested may be determined by the processing device140 and/or the control module 174 based on the number of requests foreach data block since it has been stored in the cache 148, or within apredetermined period of time, for example. The count maintained in themetadata 208 the hash table 200, discussed above, may be used.

The types of requests (read, write, and/or restore) for respective datablocks, which may also be stored in the metadata 208, may be considered,with priority given to data blocks subject to a particular type ofrequest.

Other factors that may be considered by the processing device 140 and/orthe control module 174 in selecting a data block to move out of thecache 148 are the locations of respective data blocks in the clientmachine providing the data of the data block for deduplication and/or anumber of locations of the data on the client machine providing the dataof the data block for deduplication. The particular locations of datablocks on the client machines 110, 112, 114 and the number of locationsthe same data block is stored on the respective client machine 110, 112,114 may be determined in manners know in the art. For example, asdiscussed above, the locations of data blocks that have beendeduplicated may be found in column 204 in the hash table 200. Inanother example, an agent on the respective client machine may be usedto determine how many times a data block is stored on a client machine.In another example, the deduplication request sent by a client machinemay include the location of the data block on the client machine. If thededuplicated data is stored on tape, common file headers may be reviewedto the number of times a respective data block is stored on the clientmachine.

The identity of the client machine 110, 112, 114 providing the data ofthe data block for deduplication may also be a factor in determiningwhether to store a received data block in the cache 174. The identity ofthe client machine may be provided in the request to deduplicate thedata, for example and stored in the hash table 200 in the metadata 208,for example.

Another example of a factor is a number of times respective data blockshave been received from multiple client machines 110, 112 and/or 114.This is another indication of the importance of a data block to morethan one client.

As noted above, different client machines 110, 112, 114 may request thatdifferent factors be given priority, via the GUI, for example.

FIG. 4 is a flow chart of an example of a method of storing deduplicateddata blocks in the data processing system 100 of FIG. 1, in accordancewith an embodiment of the invention. The method may be performed by thededuplication system 140, under the control of the processor 142 andcontrol module 174, for example. In this example, the control module 174and the processor 142 of the deduplication system 140 are configured tooperate based on the assumption that the most recently deduplicated datablocks are most likely to be requested (restored, retrieved, read,and/or written to) by client machines 110, 112, 114. In other examples,other prioritized query methods based on other factors, such as thosediscussed above, may applied.

In Step 302, data is received from or accessed at a respective clientmachine 110-114, such as the client machine 110, for example. Thecontrol module 174 divides the data in each file into blocks of a fixedor variable size, using the memory 144 to store the blocks, for example,in Step 304. The control module 174 generates message digests (hashvalues) of the blocks stored in the memory 144, in Step 306. The messagedigests may be generated by a substantially collision free algorithm,for example, as discussed above.

The control module 174 determines whether the message digest of acurrently deduplicated data block is already stored in the hash table200, in Step 308, by comparing the hash value of that data block to thedigest values 202 already stored by the deduplication system 140 in thehash table 200, for example.

In this example, if the digest value of a current data block is alreadystored in the hash table 200, which indicates that an identical datablock has already been deduplicated and stored, it is determined whetherthe current data block is already stored in the cache 148, in Step 310.The cache 148 may comprise one or more SSDs, for example, from whichdata may be retrieved faster than the other storage devices used by thededuplication system 140. The cache 148 may comprise other types ofstorage devices that are faster than the primary storage device 146and/or the external storage 147, for example. The control module 174 inthis example may determine that the data block is stored in the cache148 based on searching of the hash table 200 or a directory of thecache, or by other methods known in the art, for example.

If the current data block is already stored in the cache 148 (Step 310),in this example the processor 142 causes the current data block to bereplaced by a stub file on the client machine. The stub file in thisexample contains an indicator to the location of the already stored datablock in the cache 148 or primary/external storage 146 and may alsoinclude the message digest (hash value) of the block, although that isnot required. The hash table 200 may be updated by the processor 142 orthe control module 174 to include the location of the newly deduplicateddata block on the client machine 110, 112, 114 from which the currentdata block was originally stored, in association with the hash value, ifsuch information is maintained in column 204, for example.

If it is determined in Step 310 that the current data block is notalready stored in the cache 148, in Step 310 the control module 174determines whether the cache is full, in a manner known to the art. Forexample, the processor 142 or the control module 148 may keep track ofthe filling of the cache 148 so that the processor or control module candetermine whether there is available room in the cache. If the processor142 stores such information, then the control module 148 may send amessage to the processor 142 asking whether there is room in the cachefor the current data block.

If the message digest of the current data block is not stored in thehash table 200 and the cache 148 is not full, then the current datablock is stored in the cache by the control module 174, in Step 318. Thehash table 200 may then be updated by the control module 174 to indicatethe location 208 of the current data block in the cache 148, inassociation with the message digest (hash value) of the data block, inStep 320. Optionally, the date the data block is stored in the cache 148may also be stored in the hash table 200, such as in the metadata 208,for example, as discussed above. The time the data block was stored mayalso be stored in the hash table 200. The date and time data blocks arestored may also be stored in another location.

If the control module 174 determines that the cache 148 is full in Step316, then, in this example, where a FIFO method is used, the oldest datablock in the cache (the data block stored for the longest period oftime) is moved by the control module 74 or the processor 142 to theprimary storage 146, in Step 322. The oldest data block may beidentified based on metadata 208 in the hash table 200 and/or thedirectory of the cache 148, which may identify when each block is storedin the cache, for example. It is noted that while the cache 148 is beingreferred to as “full,” the SSD in the cache may not be full, but theempty space may not be available for storage. This is because SSDs maymaintain extra storage that is not available for storage of new data tomaintain write performance, as is known in the art. As discussed above,if a modified FIFO or prioritized queuing methods is used, the oldestdate block is not necessarily removed.

If the processor 142 determines that the oldest data block or blocks(depending on the size of the incoming data block) has been successfullymoved to the primary storage 146, then the current data block is movedto the cache 148, in Step 318 and the hash table 200 is updated toindicate the current location 206 in Step 320, as discussed above. Theindicator(s) in the stub file(s) corresponding to the original datablocks may also be updated to point to the new location of the datablock, by the deduplication system 140, via an agent, for example.

If the processor 142 determines that the oldest data block has not beensuccessfully moved, then the current data block is stored in the primarystorage 146 or the external storage 147, in Step 326 because there is noroom in the cache 148.

Returning to Step 308, if it is determined that the message digest isnot already in the hash table 200, then a data block identical to thecurrent data block has not already been received or accessed fordeduplication. In this case, the message digest is stored in the hashtable 200, along with an indicator to the location 206 of thecorresponding data block in the cache 148 and the location 204 of thedata block on the client, in Step 314. The method 300 proceeds to Step316 and continues, as discussed above.

As discussed above, instead of always storing the currently receiveddata block in the cache 148 in Step 318, if it is not already storedthere, one or more other factors may be taken into consideration indetermining whether to store a currently received data block in thecache or in the primary storage 146. For example, only data blocksreceived from a particular one or a few of the client machines 110, 112,114 may be stored in the cache 148. In another example, data blocksstored in more than a predetermined number of locations on the clientmachine 110, 112, 114 sending the data block for deduplication, may bestored in the cache 148. In another example, data blocks received from aparticular storage location on a respective client machine 110, 112,114, such as from a particular drive, for example, may be stored in thecache 148. As discussed herein, the processor 142 and/or the controlmodule 174, under the control of an algorithm stored in the disk storage144 c, for example, may determine whether to store the currentdeduplicated data block in the cache 148 or the primary storage 146,based on two or more factors. The algorithm may include weightings forrespective factors.

In addition, as described above, in other examples, the oldest datablock may not be moved out of the cache 148 to make room for a currentdata block. One or more other factors may be considered in addition toor instead of moving the oldest data block. The flowchart 300 of FIG. 4may be readily modified by one of ordinary skill in the art to apply oneor more other factors in determining whether to store a current datablock in the cache 148 and to determine which data block in the cache tomove out of the cache. As discussed herein, the processor 142 and/or thecontrol module 174, under the control of algorithms stored in the diskstorage 144 c, for example, may determine data block in the cache 148 tomove out of the cache 148 when necessary to make room for a new datablock, based on two or more factors. The algorithm may includeweightings for respective factors.

As discussed above, respective client machines 110, 112, 114 mayinstruct the deduplication system 140 to use particular criteria todetermine whether to store a particular type of data block in the cache148 and/or to move a particular type of data block from the cache, whennecessary. Options may be presented in the GUI, for example, asdiscussed above, during a set up procedure or registration procedurewith the deduplication system 140 or at a later time, for example.

FIG. 5 is a flowchart of an example of a method 400 for retrieval of adeduplicated data block by the processor 142 of the deduplication system140 in response to a request from a client machine 110, 112, 114, suchas the client machine 110, for example. The request may be a restore,read, or write request, for example. The request is received by theinterface 172 of the deduplication system 140 from a client machine 110,112, 114, via the network 120, in Step 402. The request may include themessage digest of the data block, if it was included in the stub file.The control module 174, for example, may search the hash table 200 for amatching message digest. If there is a match, the control module 174identifies the storage location of the data block corresponding to themessage digest, in column 206, for example. Alternatively, the datablock may be located on the client machine 110, 112, 114 through adirectory that includes the message digest, or a pointer to the datalocation in the stub file, for example.

If it is determined that the data block location is in the cache 148, inStep 406, the data block is retrieved, in Step 408, and provided to therequesting client machine 110, in Step 410, by the interface 172 overthe network 122. The metadata 208 associated with the hash value of thedata block in the hash table 200 may be updated to indicate thedate/time of the current request.

If it is determined that the requested data block is not in the cache148, in Step 406, then it is determined whether the data block is in theprimary storage 146 in Step 412.

If it is determined in Step 412 that the data block is present in theprimary storage 146, for example, the block is retrieved by theprocessor 142, in Step 414, and provided to the requesting clientmachine 110 via the interface 172 and the network 120, in Step 410.

In this example of a FIFO method, it is assumed that the most recentlyrequested data block is most likely to be requested again. The processor142 therefore attempts to transfer the data block to the cache 148, inStep 416, by proceeding to Step 316 of the method 300 to determinewhether there is room in the cache. The method 300 then proceeds throughSteps 316-326, as discussed above, by the processor 142, for example.

If it is determined that the data block is not in the primary storage142 in Step 412, then an error has occurred and a read error message isprovided to the client machine, in Step 418.

It will be appreciated by those skilled in the art that changes may bemade to the embodiments described herein, without departing from thespirit and scope of the inventions, which are defined by the followingclaims.

I claim:
 1. A method for retrieving deduplicated data from adeduplication system comprising a first storage device and a secondstorage device for storing deduplicated data, wherein data isretrievable from the first storage device faster than data isretrievable from the second storage device, the method comprising:receiving a request from a client machine for deduplicated data;determining a location of the requested deduplicated data; and if thedata is in the second storage device: retrieving the data; providing theretrieved data to the client machine; and moving the retrieved data tothe first storage device.
 2. The method of claim 1, wherein the requestfor deduplicated data includes a message digest of the deduplicateddata, the method further comprising; determining the location of therequested deduplicated data based, at least in part, on the digest. 3.The method of claim 2, comprising determining the location of therequested deduplicated data by checking a table correlating messagedigests with locations of data corresponding to the message digest onthe first and second storage devices.
 4. The method of claim 1, furthercomprising, prior to moving the retrieved data to the first storagedevice: determining whether the first storage device is full; and if thefirst storage device is full, moving a data block from the first storagedevice to the second storage device prior to moving the retrieved datato the first storage device.
 5. The method of claim 4, comprisingselecting a data block to move to the second storage device based, atleast in part, on: how long a data block has been in the first storagedevice; how long a data block has been in the first storage devicewithout being requested; a number of past requests for each data block;a probability that a respective data block will be requested; types ofrequests for respective data blocks; a number of times respective datablock have been received for deduplication from more than one clientmachine; a location of respective data blocks on a client machineproviding the data of the data block for deduplication; a number oflocations of the same data block on the client machine providing thedata of the data block for deduplication; and an identity of the clientmachine providing the data of the data block for deduplication.
 6. Themethod of claim 5, comprising selecting the data block by applying analgorithm including respective weightings for the one or more of thefactors.
 7. The method of claim 1, wherein: the first storage devicecomprises a non-spinning storage device; and the second storage devicecomprises a spinning storage device.
 8. The method of claim 7, whereinthe non-spinning storage device comprises a solid state drive storagedevice.
 9. The method of claim 1, wherein the deduplicated data waspreviously received from the client machine, by the deduplication systemthe method further comprising: deduplicating the data; determiningwhether to store the received data in the first storage device or thesecond storage device; and storing the data in the determined firststorage device or second storage device.
 10. A deduplication systemcomprising: a first storage device; a second storage device, whereindata is retrievable from the first storage device faster than data isretrievable from the second storage device: at least one processingdevice configured to: receive a request from a client machine fordeduplicated data; determine a location of the requested deduplicateddata; and if the data is in the second storage device: retrieve thedata; provide the retrieved data to the client machine; and move theretrieved data to the first storage device.
 11. The system of claim 10,wherein: the first storage device comprises a non-spinning storagedevice; and the second storage device comprises a spinning storagedevice.
 12. The system of claim 11, wherein the first storage devicecomprises a solid state storage device.
 13. The system of claim 11,wherein the request for deduplicated data comprises a message digest ofthe deduplicated data and the at least one processing device isconfigured to determine the location by: checking a table correlatingmessage digests with locations of data with the first and second storagedevices.
 14. The system of claim 11, wherein the at least one processingdevice is further configured to: determine whether the first storagedevice is full; and if the first storage device is full, move a seconddata block from the first storage device to the second storage deviceprior to moving the retrieved data block to the first storage device.15. The system method of claim 14, wherein the at least one processingdevice is further configured to select the data block to move to thesecond storage device based, at least in part, on two or more of thefollowing factors: how long a data block has been in the first storagedevice; how long a data block has been in the first storage devicewithout being requested; a number of past requests for each data block;a probability that a respective data block will be requested; a numberof times respective data blocks have been received for deduplication;types of requests for respective data blocks in a predetermined timeperiod; a location of respective data blocks in the client machineproviding the data block for deduplication; a number of locations of thedata on the client machine; and/or an identity of the client machineproviding the data of the data block for deduplication.
 16. The systemof claim 15, wherein the at least one processing device is configured toselect the data block by applying an algorithm including respectiveweightings for the one or more of the factors.
 17. The system of claim11, wherein the data was previously received by the at least oneprocessing device for deduplication, the at least one processing devicebeing further configured to: deduplicate the data; determine whether tostore the data in the first storage device or the second storage device;and store the data in the determined first storage device or secondstorage device.